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Abstract 

We compare the risk of ridge regression to a simple variant of ordinary least squares, in which 
one simply projects the data onto a finite dimensional subspace (as specified by a Principal 
Component Analysis) and then performs an ordinary (un-regularized) least squares regression 
in this subspace. This note shows that the risk of this ordinary least squares method is within 
a constant factor (namely 4) of the risk of ridge regression. 



1 Introduction 

Consider the fixed design setting where we have a set of n vectors X = {Xi}, and let X denote the 
matrix where the i th row of X is X^. The observed label vector is Y € W 1 . Suppose that: 

Y = Xf3 + e 

where e is independent noise in each coordinate, with the variance of being a 2 . 
The objective is to learn E[Y] = Xf3. The expected loss of a vector w is estimator is: 

L(w) = -e y [\\y - Xw\\ 2 } 

n 

Let f3 be an estimator of /? (constructed with a sample Y). Denoting 

1 T 
£ := -X r X 

n 

we have that the risk (i.e. expected excess loss) is: 

Risk(/3) := E $ [L0) - L{fi)\ = M0 - /3||| 

where ||a;||s = x T ~Sx and where the expectation is with respect to the randomness in Y. 

We show that a simple variant of ordinary (un-regularized) least squares always compares favorably 
to ridge regression (as measured by the risk). This observation is based on the following bias variance 
decomposition: 

RiskQ3)= E||/3 ~ fills + (1-1) 
Variance Prediction Bias 

where /3 = E[/3]. 
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1.1 The Risk of Ridge Regression 



Ridge regression or Tikhonov Regularization Tikhonov . 1963j ] penalizes the £2 norm of a parameter 



vector w and "shrinks" (3 towards zero, penalizing large values more. The estimator is: 

(3\ = argmin{||y — Xu;|| 2 + A||u/|| 2 } 

w 

The closed form estimate is then: 



Note that 

$0 = /3a=o = argmin{||y - Xw|| 2 } 

w 

is the ordinary least squares estimator. 
Without loss of generality, rotate X such that: 

£ = diag(Ai, A 2 , . . . , A p ) 
where Aj's are ordered in decreasing order. 
To see the nature of this shrinkage observe that: 



^-=a7Ta^ 



x, 

where /3q is the ordinary least squares estimator. 

Using the bias- variance decomposition, (Equation II. ip . we have that: 

Lemma 1. We have: 

Risk(/3 A ) = °- V ( 2 + V ^— 

The proof is straightforward and provided in the appendix. 

2 Ordinary Least Squares with PC A 

Now let us construct a simple estimator based on A. Note that our rotated coordinate system where 
S is equal to diag{X\, A2, • • • , A p ) corresponds the PCA coordinate system. 

Consider the following ordinary least squares estimator on the "top" PCA subspace — it uses the 
least squares estimate on coordinate j if Xj > A and otherwise. 

PCAX } ={ fob if A^>A 
' 3 \ otherwise 

The following claim shows this estimator compares favorably to the ridge estimator (for every A). 
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Theorem 2.1. (Bounded Risk Inflation) For all A > 0, we have that: 



Risk(/3 PCAA ) < 4 Risk(/3 A ) 



Proof. Using the bias variance decomposition of the risk we can write the risk as: 



Risk(/3 PW ) = — ^ lAj ■> A + Yl 



The first term represents the variance and the second the bias. 

The ridge regression risk is given by Lemma 1. We now show that the j th term in the expression 
for the PC A risk is within a factor 4 of the j th term of the ridge regression risk. First, lets consider 
the case when Xj > A, then the ratio of j th terms is: 



3 Conclusion 

We showed that the risk inflation of a particular ordinary least squares estimator (on the "top" 
PC A subspace) is within a factor 4 of the ridge estimator. 
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Similarly, if Xj < A, the ratio of the j terms is: 




Since, each term is within a factor of 4 the proof is complete. 



□ 
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Appendix 



Proof. We analyze the bias- variance decomposition in Equation ll.il For the variance, 



E Y \\p x -f3 x \ 



Similarly, for the bias, 



A,E y ([/3A 


i" 


A, 


1 


(a, + xy 


o 
/?- 


Aj 




(A, + xy 


71 


A, 


a 2 



E 



i=l 



i'=l 



(A, + A) 2 n ^ 



E^ 



-E 



n ^ (Aj + A) 2 



1 Ai + A 



(1 + ^) 2 



which completes the proof. 



□ 
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